12. Loading Data into a pandas DataFrame
Loading Data into a pandas DataFrame
Pandas 7 V1
In machine learning you will most likely use databases from many sources to train your learning algorithms. Pandas allows us to load databases of different formats into DataFrames. One of the most popular data formats used to store databases is csv. CSV stands for
Comma Separated Values
and offers a simple format to store data. We can load CSV files into Pandas DataFrames using the
pd.read_csv()
function. Let's load Google stock data into a Pandas DataFrame. The GOOG.csv file contains Google stock data from 8/19/2004 till 10/13/2017 taken from Yahoo Finance.
# We load Google stock data in a DataFrame
Google_stock = pd.read_csv('./GOOG.csv')
# We print some information about Google_stock
print('Google_stock is of type:', type(Google_stock))
print('Google_stock has shape:', Google_stock.shape)
Google_stock is of type: class 'pandas.core.frame.DataFrame'
Google_stock has shape: (3313, 7)
We see that we have loaded the GOOG.csv file into a Pandas DataFrame and it consists of 3,313 rows and 7 columns. Now let's look at the stock data
Google_stock
Date Open High Low Close Adj Close Volume 0 2004-08-19 49.676899 51.693783 47.669952 49.845802 49.845802 44994500 1 2004-08-20 50.178635 54.187561 49.925285 53.805050 53.805050 23005800 2 2004-08-23 55.017166 56.373344 54.172661 54.346527 54.346527 18393200 … … 3311 2017-10-12 987.450012 994.119995 985.000000 987.830017 987.830017 1262400 3312 2017-10-13 992.000000 997.210022 989.000000 989.679993 989.679993 1157700
3313 rows × 7 columns
We see that it is quite a large dataset and that Pandas has automatically assigned numerical row indices to the DataFrame. Pandas also used the labels that appear in the data in the CSV file to assign the column labels.
When dealing with large datasets like this one, it is often useful just to take a look at the first few rows of data instead of the whole dataset. We can take a look at the first 5 rows of data using the
.head()
method, as shown below
Google_stock.head()
Date Open High Low Close Adj Close Volume 0 2004-08-19 49.676899 51.693783 47.669952 49.845802 49.845802 44994500 1 2004-08-20 50.178635 54.187561 49.925285 53.805050 53.805050 23005800 2 2004-08-23 55.017166 56.373344 54.172661 54.346527 54.346527 18393200 3 2004-08-24 55.260582 55.439419 51.450363 52.096165 52.096165 15361800 4 2004-08-25 52.140873 53.651051 51.604362 52.657513 52.657513 9257400
We can also take a look at the last 5 rows of data by using the
.tail()
method:
Google_stock.tail()
Date Open High Low Close Adj Close Volume 3308 2017-10-09 980.000000 985.424988 976.109985 977.000000 977.000000 891400 3309 2017-10-10 980.000000 981.570007 966.080017 972.599976 972.599976 968400 3310 2017-10-11 973.719971 990.710022 972.250000 989.250000 989.250000 1693300 3311 2017-10-12 987.450012 994.119995 985.000000 987.830017 987.830017 1262400 3312 2017-10-13 992.000000 997.210022 989.000000 989.679993 989.679993 1157700
We can also optionally use
.head(N)
or
.tail(N)
to display the first and last
N
rows of data, respectively.
Let's do a quick check to see whether we have any
NaN
values in our dataset. To do this, we will use the
.isnull()
method followed by the
.any()
method to check whether any of the columns contain
NaN
values.
Google_stock.isnull().any()
Date False
Open False
High False
Low False
Close False
Adj Close False
Volume False
dtype: bool
We see that we have no
NaN
values.
When dealing with large datasets, it is often useful to get statistical information from them. Pandas provides the
.describe()
method to get descriptive statistics on each column of the DataFrame. Let's see how this works:
# We get descriptive statistics on our stock data
Google_stock.describe()
Open High Low Close Adj Close Volume count 3313.000000 3313.000000 3313.000000 3313.000000 3313.000000 3.313000e+03 mean 380.186092 383.493740 376.519309 380.072458 380.072458 8.038476e+06 std 223.818650 224.974534 222.473232 223.853780 223.853780 8.399521e+06 min 49.274517 50.541279 47.669952 49.681866 49.681866 7.900000e+03 25% 226.556473 228.394516 224.003082 226.407440 226.407440 2.584900e+06 50% 293.312286 295.433502 289.929291 293.029114 293.029114 5.281300e+06 75% 536.650024 540.000000 532.409973 536.690002 536.690002 1.065370e+07 max 992.000000 997.210022 989.000000 989.679993 989.679993 8.276810e+07
If desired, we can apply the
.describe()
method on a single column as shown below:
# We get descriptive statistics on a single column of our DataFrame
Google_stock['Adj Close'].describe()
count 3313.000000
mean 380.072458
std 223.853780
min 49.681866
25% 226.407440
50% 293.029114
75% 536.690002
max 989.679993
Name: Adj Close, dtype: float64
Similarly, you can also look at one statistic by using one of the many statistical functions Pandas provides. Let's look at some examples:
# We print information about our DataFrame
print()
print('Maximum values of each column:\n', Google_stock.max())
print()
print('Minimum Close value:', Google_stock['Close'].min())
print()
print('Average value of each column:\n', Google_stock.mean())
Maximum values of each column:
Date 2017-10-13
Open 992
High 997.21
Low 989
Close 989.68
Adj Close 989.68
Volume 82768100
dtype: object
Minimum Close value: 49.681866
Average value of each column:
Open 3.801861e+02
High 3.834937e+02
Low 3.765193e+02
Close 3.800725e+02
Adj Close 3.800725e+02
Volume 8.038476e+06
dtype: float64
Another important statistical measure is data correlation. Data correlation can tell us, for example, if the data in different columns are correlated. We can use the
.corr()
method to get the correlation between different columns, as shown below:
# We display the correlation between columns
Google_stock.corr()
Open High Low Close Adj Close Volume Open 1.000000 0.999904 0.999845 0.999745 0.999745 -0.564258 High 0.999904 1.000000 0.999834 0.999868 0.999868 -0.562749 Low 0.999845 0.999834 1.000000 0.999899 0.999899 -0.567007 Close 0.999745 0.999868 0.999899 1.000000 1.000000 -0.564967 Adj Close 0.999745 0.999868 0.999899 1.000000 1.000000 -0.564967 Volume -0.564258 -0.562749 -0.567007 -0.564967 -0.564967 1.000000
A correlation value of 1 tells us there is a high correlation and a correlation of 0 tells us that the data is not correlated at all.
We will end this Introduction to Pandas by taking a look at the
.groupby()
method. The
.groupby()
method allows us to group data in different ways. Let's see how we can group data to get different types of information. For the next examples we are going to load fake data about a fictitious company.
# We load fake Company data in a DataFrame
data = pd.read_csv('./fake_company.csv')
data
Year Name Department Age Salary 0 1990 Alice HR 25 50000 1 1990 Bob RD 30 48000 2 1990 Charlie Admin 45 55000 3 1991 Alice HR 26 52000 4 1991 Bob RD 31 50000 5 1991 Charlie Admin 46 60000 6 1992 Alice Admin 27 60000 7 1992 Bob RD 32 52000 8 1992 Charlie Admin 28 62000
We see that the data contains information for the year 1990 through 1992. For each year we see name of the employees, the department they work for, their age, and their annual salary. Now, let's use the
.groupby()
method to get information.
Let's calculate how much money the company spent in salaries each year. To do this, we will group the data by
Year
using the
.groupby()
method and then we will add up the salaries of all the employees by using the
.sum()
method.
# We display the total amount of money spent in salaries each year
data.groupby(['Year'])['Salary'].sum()
Year
1990 153000
1991 162000
1992 174000
Name: Salary, dtype: int64
We see that the company spent a total of 153,000 dollars in 1990, 162,000 in 1991, and 174,000 in 1992.
Now, let's suppose I want to know what was the average salary for each year. In this case, we will group the data by
Year
using the
.groupby()
method, just as we did before, and then we use the
.mean()
method to get the average salary. Let's see how this works
# We display the average salary per year
data.groupby(['Year'])['Salary'].mean()
Year
1990 51000
1991 54000
1992 58000
Name: Salary, dtype: int64
We see that the average salary in 1990 was 51,000 dollars, 54,000 in 1991, and 58,000 in 1992.
Now let's see how much did each employee get paid in those three years. In this case, we will group the data by
Name
using the
.groupby()
method and then we will add up the salaries for each year. Let's see the result
# We display the total salary each employee received in all the years they worked for the company
data.groupby(['Name'])['Salary'].sum()
Name
Alice 162000
Bob 150000
Charlie 177000
Name: Salary, dtype: int64
We see that Alice received a total of 162,000 dollars in the three years she worked for the company, Bob received 150,000, and Charlie received 177,000.
Now let's see what was the salary distribution per department per year. In this case we will group the data by
Year
and by
Department
using the
.groupby()
method and then we will add up the salaries for each department. Let's see the result
# We display the salary distribution per department per year.
data.groupby(['Year', 'Department'])['Salary'].sum()
Year Department
1990 Admin 55000
HR 50000
RD 48000
1991 Admin 60000
HR 52000
RD 50000
1992 Admin 122000
RD 52000
Name: Salary, dtype: int64
We see that in 1990 the Admin department paid 55,000 dollars in salaries,the HR department paid 50,000, and the RD department 48,0000. While in 1992 the Admin department paid 122,000 dollars in salaries and the RD department paid 52,000.